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Figure  1:  Sample  outcomes  of  our  scheme:  background  c{x)  =  0  (gray)  and  foreground  layers  c{x)  =  1,  c{x)  =  2,  c{x)  =  3 
indicated  by  □,  □,  ■  respectively.  On  the  far  right,  our  algorithm  correctly  infers  that  the  bag  strap  is  in  front  of  the  woman’s 
arm,  which  is  in  front  of  her  trunk,  which  is  in  front  of  the  background.  Project  page:  http :  /  /vision .  ucla .  edu/cvos/ 


Abstract 

Occlusion  relations  inform  the  partition  of  the  image  do¬ 
main  into  ‘‘objects”  but  are  difficult  to  determine  from  a  sin¬ 
gle  image  or  short-baseline  video.  We  show  how  long-term 
occlusion  relations  can  be  robustly  inferred  from  video,  and 
used  within  a  convex  optimization  framework  to  segment  the 
image  domain  into  regions.  We  highlight  the  challenges  in 
determining  these  occluder/occluded  relations  and  ensuring 
regions  remain  temporally  consistent,  propose  strategies  to 
overcome  them,  and  introduce  an  efficient  numerical  scheme 
to  perform  the  partition  directly  on  the  pixel  grid,  without 
the  need  for  superpixelization  or  other  preprocessing  steps. 

1.  Introduction 

Partitioning  the  image  domain  into  regions  that  corre¬ 
spond  to  “objects”  is  elusive  absent  an  explicit  definition  of 
objects  that  has  a  measurable  correlate  in  the  image.  Gestalt 
principles  [33]  provide  grouping  criteria:  continuity,  regular¬ 
ity,  proximity,  compactness,  the  last  of  which  (figure/ground, 
or  occlusion)  is  best  informed  by  video.  Occlusions  have 
been  used  extensively  for  grouping  [32,  5,  8,  3].  A  feature  of 
[3]  is  that  grouping  is  obtained  via  a  linear  program:  local  or¬ 
dering  constraints  provided  by  occluder/occluded  relations 
are  integrated  to  globally  partition  the  image  domain  into 
depth  layers.  The  challenge  is  that  errors  in  determining 
occlusion  relations  can  have  a  cascading  effect. 

Occlusions  are  usually  detected  from  the  residual  of  op¬ 
tical  flow,  but  even  assuming  this  detection  is  correct,  oc¬ 
cluder  relations  are  non-trivial  to  determine.  As  we  show  in 
Fig.  2,  correct  determination  of  the  occluder  requires  either 
knowledge  of  the  motion  of  the  occluded  region  (which  is 


undefined),  or  knowledge  of  its  partition  into  regions.  Hence 
the  conundrum:  to  determine  occlusion  relations,  so  that 
objects  can  be  segmented,  we  need  to  know  the  objects  in 
the  first  place.  The  first  contribution  of  our  work  is  to  break 
the  conundrum  by  leveraging  motion  and  appearance  pri¬ 
ors  to  hallucinate  motion  in  the  occluded  region.  With  the 
occluder/occluded  relations  we  can  obtain  a  depth-layer  par¬ 
tition  for  the  image  domain.  In  video,  however,  nuisances 
such  as  motion  blur,  quantization,  scale,  and  lack  of  motion 
can  cause  layer  segmentation  to  fail.  Thus,  the  second  contri¬ 
bution  is  a  causal  framework  for  integrating  occlusion  cues 
exploiting  temporal  consistency  priors  to  partition  the  video 
into  depth  layers.  Our  third  contribution  is  to  make  the  solu¬ 
tion  of  the  resulting  optimization  problem  efficient  using  a 
primal-dual  scheme.  Our  proposed  method  is  competitive  to 
state-of-the-art  approaches  qualitatively  in  visual  boundaries 
and  quantitatively  in  numerical  benchmarks,  while  process¬ 
ing  video  sequences  causally,  rather  than  in  batch.  Samples 
from  our  scheme  are  shown  in  Fig.  1. 

The  paper  is  organized  as  follows:  we  set  up  our  problem 
in  Sec.  2.  We  describe  our  first  contribution  in  determining 
occluder  relations  in  Sec.  2.1  and  how  we  leverage  prior 
work  [3]  in  Sec.  2.2.  Sec.  3  explores  how  we  causally 
integrate  cues  to  construct  priors  for  foreground  regions  in 
Sec.  3.1,  obtain  persistent  object  boundaries  in  Sec.  3.2,  and 
aggregate  occluder  relations  in  Sec.  3.3.  Our  final  model 
is  presented  in  Sec.  3.4.  Implementation  and  optimization 
details  are  covered  in  Sec.  4-5,  including  our  approach  for 
hallucinating  motion  in  the  occluded  regions  in  Sec.  4.2. 
Empirical  evaluation  appears  in  Sec.  6,  where  we  show  that 
the  typical  failure  modes  of  prior  approaches  stemming  from 
unreliable  occlusion  relations  are  mitigated. 
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1.1.  Related  Work 


t 


A  large  number  of  methods  have  been  proposed  for  parti¬ 
tioning  a  video  sequence  into  non-overlapping  regions  with 
unique  labels,  using  motion,  appearance  or  their  combination 
[25,  7,  18,  31,  22,  11,  34,  16,  36,  24,  23].  These  approaches 
are  susceptible  to  oversegmentation,  which  video  object  seg¬ 
mentation  attempts  to  mitigate  by  assigning  a  single  label  to 
each  object.  The  problem  can  be  cast  as  multi-label  classifi¬ 
cation,  in  which  a  unique  label  is  attached  to  each  object  [23], 
or  as  binary  “foreground”/“background”  (FG/BG)  classifica¬ 
tion  [12,  22,  36,  24].  While  our  work  produces  depth  layers, 
and  not  object  labels,  these  could  be  added  post-mortem. 

Many  approaches  operate  offline  (or  non- cans  ally),  with 
the  entire  video  available  for  processing  [23,  22,  36,  24], 
which  scales  poorly  with  sequence  length,  although  “stream¬ 
ing”  approaches  can  be  used  [31,  34].  Our  approach  is  on¬ 
line  (or  causal),  and  is  closely  related  to  tracking  [25,  4,  10], 
which,  unlike  us,  requires  manual  initialization. 

Estimation  of  segmentation  masks,  motion,  and  depth 
ordering  can  be  formulated  jointly  [13,  32,  5,  20,  27,  19, 
21,  29,  26,  10,  35],  but  the  resulting  problem  is  nonconvex 
and  requires  a  substantial  computational  effort.  We  separate 
motion  estimation  from  segmentation  and  depth  ordering, 
and  focus  on  the  latter,  which  makes  a  scalable  convex  for¬ 
mulation  possible. 

2.  Video  segmentation  with  layers 

Let  It  :  D  ^  be  an  image  of  a  video  defined 

on  the  domain  D  cM?.  We  seek  to  partition  D  into  regions, 
each  associated  with  an  integer  depth  order,  represented  by 
a  function  Ct  :  D  ^  z+  indicating  to  which  layer  each 
pixel  belongs.  A  layer  is  then  c^^{i)  =  {x  E  D\ct{x)  =  i}, 
where  Ct{x)  =  0  denotes  the  background  and  larger  values 
of  Ct  indicate  “foreground”  regions  Ct{x)  =  1,  2,  3, ... .  The 
connected  components  of  non-zero  regions  correspond  to 
individual  objects.  It  was  shown  by  [5,  3]  that  depth  layers 
can  be  inferred  from  occlusion  phenomena,  that  occur  as  a 
result  of  object  or  viewer  motion,  causing  parts  of  the  scene 
to  become  hidden  and  others  revealed.  These  inform  local 
order  relations  between  surfaces  in  the  scene:  when  a  surface 
becomes  occluded,  the  image  region  where  it  projected  to 
becomes  occupied  by  the  occluder,  which  is  therefore  closer 
to  the  viewer.  These  occluder-occluded  relationships  can 
be  used  as  cues  for  segmenting  regions  in  the  image  that 
back-project  to  distinct  objects  in  the  scene. 

2.1.  The  ^^occluder”  and  the  ^^occluded” 

Under  the  assumptions  of  Lambertian  reflection,  con¬ 
stant  illumination,  and  co-visibility  typically  implicit  in  most 
optical  flow  algorithms,  It{x)  is  related  to  /t+i(x)  by  the 
brightness-constancy  equation 

It{x)  =  It+i{wl'^^{x)) +nt{x),  X  e  D\Qy~^{x),  (1) 


Ligure  2:  Initial  (top)  and  final  (middle)  views  of  a  smaller 
square  sliding  under  a  larger  one,  producing  an  occluded 
region  U  in  red  (subscripts  dropped  for  readability).  Two 
alternate  hypotheses  (left  and  right)  for  the  occluder  (13) 
in  yellow  produce  different  constraints  (bottom).  Left;  U 
moves  with  E  and  slides  under  A  U  B.  Right;  the  occluder  is 
split  in  two — B  occludes  C  and  E  occludes  D.  Disambigua¬ 
tion  requires  either  knowledge  of  the  motion  in  U,  which  is 
undetermined  as  it  is  occluded,  or  the  object  segmentation, 
which  is  the  final  goal. 

where  wl'^^  is  the  deformation  field  that  warps  the  domain 
of  It  into  it+i  and  Tit  lumps  together  all  un-modeled  phe¬ 
nomena  and  violations  of  the  assumptions.  Often,  is 
represented  by  the  optical  flow  field  vl'^^  by  wl'^^{x)  = 
X  +  vl'^^{x).  The  above  holds  on  the  entire  image  domain 
except  in  the  occluded  regions  where  surfaces  visible 
at  time  t  are  no  longer  visible  at  t  -b  1.  In  this  region,  the 
optical  flow  is  not  defined,  but  can  be  extrapolated  from  the 
“co-visible”  regions  via  regularization.  Occluded  regions  are 
easy  to  find  as  a  byproduct  of  optical  flow  estimation  [2],  as 
they  yield  a  large  residual  rit  via  backward  flow.  What  is  not 
easy  to  find  is  the  occluder. 

The  defining  characteristic  of  the  occluder  point  G 
(the  occluder  region)  corresponding  to  the  occluded 
point  yt  e  (the  occluded  region)  is 

=  (2) 

This  equation  is  somewhat  unintuitive  as  the  left  hand- side 
lives  in  the  domain  of  the  image  at  time  f  +  1  whereas 
the  right-hand  side  is  defined  only  at  time  t.  This  can  be 
interpreted  as 

yi  =  wU,iyt),  (3) 

which  is  completetely  agnostic  of  the  motion  of  the  occluded 
region. 

Consider  Lig.  2:  The  occluded  region,  C  U  D,  could 
slide  under  the  larger  rectangle,  and  become  occluded  by 
A  U  B.  However,  C  and  D  could  also  actually  correspond  to 
different  objects,  and  move  independently.  In  this  case,  B 
could  be  the  occluder  of  C  and  E  could  be  the  occluder  of  D. 
To  disambiguate  between  these  two  hypotheses,  we  need  to 
know  either  the  motion  of  D,  which  is  not  possible  since  it 


is  occluded,  or  the  object  partition,  which  is  our  goal  in  the 
first  place.  In  the  example  in  question,  using  (2)  would  favor 
the  hypothesis  of  B  occluding  C  and  E  occluding  D  (right 
half  of  Fig.  2).  This  would  yield  two  ordering  constraints, 
c(B)  >  c(C)  and  c(E)  >  c(D)  that  hinge  on  the  occluded 
region  and  impose  no  constraints  between  the  visible  regions 
B  and  E.  The  latter  constraint  is  also  incorrect  in  the  example 
(Fig.  2  bottom  right). 

However,  while  the  motion  in  the  occluded  region  is  not 
determined,  it  can  be  hallucinated  exploiting  regularization 
priors.  Even  with  a  coarse  estimate  of  the  motion  of  D,  we 
could  determine  if  it  moves  similarly  to  E,  in  which  case  it 
cannot  be  occluded  by  it  and  must  instead  be  occluded  by 
B.  Therefore,  in  our  approach  we  extrapolate  motion  to  the 
occluded  region,  so  as  to  attribute  it  to  a  possible  occluder. 
In  Sec.  4.2,  we  discuss  how  to  exploit  natural  image  and 
motion  priors  to  achieve  this.  Of  course,  one  could  resort  to 
such  priors  and  photometric  characteristics  of  the  occluded 
region  to  directly  determine  the  grouping  of  C,  D,  and  E. 
But  again  if  this  was  easy,  we  would  have  already  solved  the 
problem  of  object  segmentation. 

2.2.  From  local  ordering  constraints  to  layers 

In  [3],  the  following  convex  model  for  inferring  q  from 
occlusion  cues  was  proposed: 

ct  =  arg  min  /  gt{x)\\/ct{x)\dx 

cf-ct>0jjj  (4) 

s.t.  ct(^^)  -  ct{y)  >  1  V(^^^)  G  Ot. 

Ot  denotes  the  set  of  occlusion  cues  composed  of  pairs 
(^^,  y),  where  y^  lies  on  the  occluding  surface,  and  y  lies  on 
the  surface  that  was  (will  be)  occlud^J  in  the  previous  (next) 
frame.  The  objective  gt  {x)  \  \/ Ct{x)\dx  is  just  weighted 
total  variation  (TV),  with  the  data-dependent  affinity  weights 
(denoted  by  gt{x))  being  small  at  image  and  motion  bound¬ 
aries  and  large  otherwise.  Note  that  the  “data- term”  en¬ 
ters  the  optimization  as  a  set  of  constraints  which  require 
occluded-occluder  pairs  to  lie  in  different  layers:  specifically, 
the  occluder  must  lie  in  the  layer  closer  to  the  viewer  (higher 
values  of  q).  An  overview  of  this  approach  is  shown  in 
Fig.  3.  While  this  optimization  problem  relaxes  the  integer 
constraint  (q  :  D  Z+),  empirically  the  solutions  are 
piecewise  constant  and  integer  valued. 

Although  this  model  is  formulated  for  a  single  time  in¬ 
stant  t,  three  frames  (t  —  1,  t,  t  +  1)  are  necessary  to  obtain 
occlusion  cues.  However,  they  are  typically  not  sufficient 
when  small  inter-frame  motion  produces  unreliable  occlu¬ 
sion  constraints.  Next,  we  exploit  temporal  persistence  to 
overcome  this  problem. 

3.  Incorporating  motion  cues  causally 

Our  causal  framework  leverages  a  rich  history  of  image 
frames,  the  segmentation  cues  from  those  frames  (occlu- 


Figure  3:  Left:  The  motion  of  two  objects  generate  occlu¬ 
sions  and  disocclusions  (both  denoted  by  Tl,  shown  in  red). 
Middle:  each  occluded  region  is  attributed  to  a  local  occluder 
(U,  shown  in  yellow).  Occluder- occluded  relationship  con¬ 
strains  objects’  depth-order.  Right:  resulting  depth  layers. 

sions  and  weights),  and  previous  layer  estimates  to  facilitate 
segmentation  in  the  current  frame.  Large-displacement  prop¬ 
agation  of  these  cues  via  wl~^  is  unstable,  rendering  cues 
unusable.  But  when  the  motion  becomes  large,  occlusions 
become  easier  to  detect,  making  the  past  unnecessary  for 
segmentation.  Thus,  these  cues  are  complementary — when 
the  motion  is  large,  sufficient  occlusion  cues  are  produced, 
and  wl~^  is  erroneous.  When  the  motion  is  small,  occlusion 
cues  are  few,  but  propagation  is  reliable.  This  motivates  an 
adaptive  integration  of  cues  based  on  motion.  A  weight¬ 
ing  function  mt{x)  =  aexp{  —  \vl~^\/yy)  is  used,  where 
a  e  [0, 1],  is  the  optical  flow,  and  gy  is  the  mean  value 
of  V  for  this  frame.  The  weight  decreases  with  large  motion, 
regardless  of  how  long  ago  it  occurred.  The  following  sec¬ 
tions  describe  the  temporal  cues  leveraged  in  our  framework. 
Note  that  the  variable  being  optimized  over  is  always  q,  and 
Ct-i  is  always  available  as  a  result  of  previous  optimization. 

3.1.  Once  an  object,  always  an  object 

Layer  values  Ct{x)  are  not  constant  over  time,  as  objects 
can  move  in  front  of  one  another  and  switch  order  of  their 
distance  to  the  viewer.  However,  once  an  object  is  detected, 
it  should  not  later  be  labeled  as  background-even  if  it  stops 
moving  and  produces  no  occlusion  cues  for  segmentation. 

This  can  be  enforced  causally  using  the  prior  segmenta¬ 
tion  result  (ct-i)  via  a  (convex)  constraint: 

Ct(x)  >1  yx  G  F,  F  =  (x))  >  1}  (5) 

where  F  is  the  indicator  of  the  previous  frame’s  foreground 
region  warped  into  the  current  frame.  To  mitigate  errors  in 
prior  segmentations,  we  relax  the  constraint  and  penalize 
violations  with  a  hinge  loss: 

[  Kt{x)  max  {0, 1  -  ct{x))dx  (6) 

JD 

with  Kt  being  the  cost  of  violating  the  constraint.  Choosing 
=  0  for  X  outside  F  allows  us  to  write  the  penalty  as 
an  integral  over  entire  image  domain  D.  As  K,t{x)  oc  for 
X  e  F,  the  hard  constraint  (5)  is  recovered. 


Figure  4:  Ct-i  (column  1 )  is  used  to  compute  the  foreground 
prior  (nt)  (column  2).  Without  tzt,  the  resulting  Ct  com¬ 
pletely  misses  the  objects  (column  3),  however  with  nt,  Ct 
succeeds  (last  column).  Note  Ct-i  and  Ct  look  very  similar — 
tvt  helps  most  during  small-baseline  motion  when  occlusion 
cues  are  weak  but  Ct-i  easily  predicts  Cf. 

The  cost  of  violating  the  constraint  is  computed  recur¬ 
sively,  with  initial  condition  {x)  =  0,  as 

Kt{x)  =  mt{x)Kt-i{wl~^{x))  >  1} 

where  1  is  a  characteristic  function  (1{X}  =  1  if  X  is 
true,  and  is  0  otherwise).  This  foreground  prior  boosts 
wherever  the  corresponding  points  are  labeled  as  foreground 
in  the  previous  frame  and  diminishes  it  over  time  and  motion 
as  described  above.  As  demonstrated  in  Fig.  4,  whenever 
motion  is  small,  instantaneous  occlusion  cues  are  insuffi¬ 
cient  to  perform  segmentation,  and  this  notion  of  temporal 
consistency  is  helpful. 

To  avoid  the  entire  image  domain  from  becoming  fore¬ 
ground,  we  introduce  an  additional  regularization  penalizing 
layer  values 


This  is  similar  to  the  regularization  used  in  [3],  although 
they  use  the  ioo  norm,  whereas  here  we  use  ii.  This  term 
encourages  pixels  to  lie  in  the  background  layer,  unless 
sufficient  evidence  pushes  them  into  the  foreground. 

3.2.  Persistent  layer  boundaries 

While  depth-layer  values  are  not  persistent,  their  bound¬ 
aries  are.  Unless  objects  split  or  merge,  we  have 

iWctix)  7^  0}  =  l{Vct_iK‘-i(a:))  7^  0}.  (8) 

This  is  a  nonconvex  constraint.  However,  enforcing 
Vct{x)  =  0  wherever  {x))  =  0  is  simple  (a  lin¬ 

ear  constraint),  and  its  relaxed  version  with  a  hinge  loss  and 
associated  cost  Ut{x)  is  equivalent  to  increasing  weights  in 
TV  regularization  (shown  in  appendix).  This  leaves  the  hard 
part:  enforcing  Vct{x)  ^  0  wherever  {x))  ^  0. 

To  remain  within  a  convex  optimization  framework,  we  treat 
this  as  a  bias  and  set  the  corresponding  Ut{x)  to  be  negative, 
which  decreases  the  corresponding  TV  weights  (which  are 
kept  nonnegative  to  preserve  convexity).  This  layer  unity 


Figure  5:  Occlusion  cues  from  the  current  frame  alone  (Of), 
with  occluded  points  (Q)  in  red  and  occluder  points  (13)  in 
yellow,  (column  1)  fail  to  segment  the  objects  (column  2). 
However,  aggregating  constraints  over  time  (Ot)  (column  3) 
succesfully  recovers  all  of  them  (last  column). 

prior  is  also  computed  recursively,  with  ui{x)  =  0,  as 
Utix)  =mtix)ut-i{wl~'^{x))  +  =  0} 

-l{Vc(_i(M;(*"^(a;))  7^0}. 

We  also  perform  temporal  aggregation  of  the  TV  affinity 
weights.  In  each  frame,  we  compute  the  boundary  strength 
Pt{x)  G  M+,  as  described  in  Sec.  4.  The  aggregated  bound¬ 
ary  strength  pt  {x)  is  (with  pi  (x)  =  0) 

Pt{x)  =  mt{x)pt-i{w\~^  {x))  +  Pt{x).  (9) 

The  aggregated  TV  weights  used  in  the  optimization  are 

gt(x)  =  max(0, 1  -  pt(x)  +  Ut(x)).  (10) 

3.3.  Occlusion  cue  aggregation 

Instantaneous  occlusion  constraints  (Ot)  are  accumulated 
into  the  aggregated  constraints  set  Ot  =  w^_i(Ot-i)  U  Ot, 
where  past  constraints  Ot-i  are  propagated  to  the  current 
frame  by  the  motion  of  the  occluder  w^_i(y^)  (see  Fig.  5). 
The  base  condition  is  Oi  =  Oi.  The  constraint  penalty 
weights  A,  computed  by  (4.1),  are  adjusted  over  time  by 

3.4.  Overall  model 

The  final  model  that  incorporates  occlusion  cues,  weights, 
foreground  and  unity  priors  is 

Ct  =  argmin  /  gt(x)lVct(x)ldx -j- r  /  Ct(x)dx 

ct>oJD  Jd 

+  f  Kt(x)m^x{0O-Ct(x))dx 

JD  (11) 

N 

+  Ai  max  (0, 1  -  Ctiy^)  -  Ct{yi)), 

where  the  first  term  (weighted  TV)  ensures  that  the  result 
is  piecewise  constant,  the  second  term  (foreground  prior) 
encourages  regions  to  have  nonzero  layer  values  wherever 
Rt{x)  is  large,  the  third  (model  selection)  term  prevents  the 
creation  of  spurious  layers,  and  the  fourth  is  the  penalty  for 
violating  the  occlusion  constraints. 


4.  Implementation  details 

For  each  frame,  we  incorporate  appearance,  edge,  and 
motion  information  into  the  weights  pt{x)  in  (9)  as  follows: 

Pt{x)  =  l-{pih{\WI{x)\)+pEh{E{x))+PM^v\+\x)\)) 

where  h{x)  =  exp(— px  is  the  average  value  of 
X.  E{x)  G  [0, 1]  is  the  output  of  an  edge  detector  [14] 
with  E{x)  ^  1  at  the  boundaries.  In  our  experiments, 
(Pi.Pe.Pv)  =  (0.2, 0.4, 0.4).  Following  [24],  we  also  ad¬ 
just  the  motion  term  by  the  difference  in  flow  angles  at  the 
pixels  where  flow  magnitude  is  small. 


Figure  6:  Cross  bilateral  filtering  extrapolates  flow  in  ft  via 
motion  and  appearance  priors,  facilitating  reliable  occluder 
determination.  Left:  The  extrapolated  motion  held 
Boxes  highlight  occluded  regions  where  notable  change  (of¬ 
ten  improvement)  occurs.  Right;  For  each  box,  (top) 
and  vl'^^  (bottom)  are  shown. 


4.1.  Occlusion  constraint  weights 

Often  the  occluded  and  occluding  surfaces  differ  in  ap¬ 
pearance,  motion,  and  are  separated  by  a  strong  image  bound¬ 
ary,  suggesting  A  be  computed  in  a  fashion  similar  to  (4): 


Xi  =rii{l-  {l3ih{\I{y^)  -  I{yi)\)  +  ^Eh{E{yi,yi)) 


where  the  gradient  operator  is  replaced  by  a  difference  be¬ 
tween  appearance,  edge,  and  motion  statistics  of  and  y. 
Here,  E{x)  is  replaced  by  E{xi^X2) — the  strongest  edge 
response  on  the  line  connecting  and  y.  We  additionally 
validate  and  y  as  an  occluder-occluded  pair  with  weight  y, 
which  measures  the  degree  to  which  y  and  y^  move  toward 
each  other.  Indeed,  unless  they  do  so,  cannot  take  the 
place  of  i.e.  when  r]{y^^  y)  in 


y)  =  max(0, 1  -  exp(-6>  A(2/'=,  y))) 


(12) 


is  small,  then  y^  is  less  likely  to  occlude  y.  We  choose 
0  =  2  so  that  A(g^,g)  =  1  yields  a  high  score.  ^  1 
whenever  the  appearance  and  motion  of  y^  and  y  are  “differ¬ 
ent”  and  the  points  are  moving  toward  each  other.  Finally, 
assuming  that  the  occluded  and  occluding  surfaces  differ 
in  appearance,  we  can  locally  perturb  constraints  with  the 
goal  of  correcting  them;  this  procedure  is  described  in  the 
appendix.  Altogether,  these  factors  alter  the  constraints  to 
help  us  discount  potentially  erroneous  cues,  which  occur  due 
to  inevitable  errors  in  optical  flow  and  occlusion  estimation. 

4.2.  Flow  extrapolation  over  the  occlusion  region 

As  noted  in  Sec.  2.1,  for  x  G  ftl'^^  is  undeflned 

(1)  and  filled  in  by  the  regularizer,  which  corresponds  to 
enforcing  priors  on  motion.  The  simplest  priors  rely  solely 
on  continuity,  tending  to  smooth  motion  boundaries,  while 
more  sophisticated  ones  attempt  to  preserve  them.  We  use 
the  cross-bilateral  filter  [15]  to  enforce  such  priors  on 


in  the  occluded  regions  based  on  the  backward  flow  ^ : 

vl+\z)  =  L  [  v\+\x)P{x^^\+^) 

Jd  (13) 

Q{vl~'^{x)  -  ay)g{x  -  z,  ax)dx, 

where  is  the  extrapolated  forward  flow,  P{x  ^ 
is  the  probability  of  x  being  visible,  Q  is  the  gaussian  kernel 
Q{x^a)  =  exp(— ||x|p/2cr^),  and  Vz  is  a  normalization 
term.  We  can  filter  the  backward  flow  vl~^  by  exchanging 
t  +  1  with  t  —  1  and  vice  versa.  Extrapolating  flow  is  key 
to  determining  the  occluder  (Fig.  2),  but  cannot  be  proven 
“correct”  as  it  hinges  critically  on  the  choice  of  prior,  is 
computed  using  publicly- available  code  [28]  (“classic-nl”), 
and  occlusions  are  computed  by  thresholding  the  residual 
image.  See  appendix  for  further  details. 

4.3.  Foreground  prior  region 

In  practice,  motion  estimation  makes  mistakes  near  object 
boundaries  (e.g.  occluded  regions).  When  computing  we 
first  warp  Kf-i  to  the  current  frame  and  then  use  morpho¬ 
logical  operations  to  erode  the  edges  proportionally  to  the 
magnitude  of  the  flow  in  that  region.  This  ensures  the  prior 
does  not  leak  outside  of  the  object  regions,  but  produces 
a  poor  estimate  near  the  boundaries.  To  help  recover  the 
structure  of  these  edges,  we  incorporate  a  set  of  local  shape 
classifiers  as  in  [4]  to  better  capture  and  predict  the  shape  of 
the  object  boundary,  the  details  of  which  are  in  the  appendix. 

5.  Optimization 

The  optimization  problem  (1 1)  is  convex  but  large  enough 
that  off-the-shelf  methods  cannot  solve  it  without  resorting  to 
superpixels  or  other  pre-processing  to  reduce  its  dimension. 
Here  we  present  an  efficient  numerical  primal-dual  scheme 
based  on  [9]  that  allows  us  to  solve  it  on  the  pixel  grid. 

The  indicator  function  -  not  to  be  confused  with  char¬ 
acteristic  function  1  used  above  -  of  a  set  A  is  defined  by 
Ia{x)  =  0  for  X  G  a  and  Ia{x)  =  oo  for  x  ^  A.  For 
a  function  /,  the  convex  conjugate  is  defined  as  /*  {y)  = 


Algorithm  1  Layer  Solver 

Initialize:  Pick  cFy^cFc  >  0,  cFyGc  <  and  0  G  [0,1]. 
Arbitrarily  initialize  feasible  Set  . 

Perform  iterates  for  /c  =  0, 1,  2, : 

y\+^  =  prox^^^.  {y\  +  (jyVx'^) 

=  Prox^^^.  (y§  +  ayVoccX^) 

c'^+i  =  prox<,^G(c"  -  +  DLy2+')) 

#+i  =c'=+i  +y(c'=+i  -c''). 


sup^  y^x  —  f{x).  The  prox  operator  of  /  is  defined  as 

pi’ox^j(y)  =  argmjn  L||a;;  _  y||2  ^4) 

Since  the  optimization  is  performed  on  a  finite  pixel  grid, 
the  depth  values  c  can  be  written  as  a  vector  c  G  M!J:,  with 
Ci  indicating  the  layer  value  at  the  i-th  pixel.  We  denote 
by  V  the  gradient  operator  represented  by  a  matrix  of  finite 
differences.  Weights  associated  with  the  edges  are  denoted 
by  the  diagonal  matrix  W.  A  difference  matrix  for  occlusion 
constraints  is  denoted  by  Voce  and  the  cost  of  violating 
constraints  by  A.  As  before,  r  is  used  for  regularization  and 

is  a  weighted  indicator  of  the  foreground  region.  We  can 
then  write  the  objective  in  shorthand  as 

min  ||lLPc||i  +  r^c  +  max(0, 1  —  c)  + 

c 

max  (0, 1  -  VoccC)  +  /{c>o}(c).  (15) 

Let  G(c)  =  T^c  +  max(0, 1  —  c)  +  I{c>o}  (c).  Also,  let 
zi  =  Vc,  Z2  =  VoccC,  construct  the  functions  ^1(2^1)  = 
||kk^i||i,  ^2(^2)  =  A^max(0, 1  —  Z2),  and  introduce  the 
dual  variables  yi,y2-  The  augmented  Lagrangian  follows  as 

min  maxFi(2;i)  +  ^2(2:2)  +  G(c)  + 

2^1, 2^2, C  yi,y2 

ytiVc- zi)  +  yl {VoccC  -  Z2),  (16) 

or,  equivalently,  using  the  convex  conjugates,  as 
min  max  G{c)  -  Fj*  (yi)  -  (2/2)  +  yf  X>c  +  y^ VoecC. 

c  yi,y2 

This  saddle-point  problem  is  addressed  in  [9],  so  we  can 
apply  their  primal-dual  algorithm  shown  in  Alg.  1. 

Alg.  1  depends  on  the  ability  to  compute  proximal  op¬ 
erators  for  G,  Fi  and  F2  .  All  three  operators  have  simple 
closed  form  solutions  that  require  few  arithmetic  operations: 

Prox^G  (y)  =  max  (O, 

min  —  err  +  cr/i:,  max  (1,  ^  —  err)))  (17) 
{y)  =  sign(^)  min{diag(VL),  |^|}  (18) 

(y)  =  ( max  (^  -  erl,  -  A) ,  0)  (19) 

Derivation  details  are  reported  in  the  appendix. 


6.  Experiments 

Our  method  segments  video  into  depth  layers.  Unfor¬ 
tunately,  no  benchmark  dataset  is  available  to  evaluate  it 
directly.  However,  our  method  can  be  modified  to  produce 
binary  and  multi-label  segmentations;  leveraging  this,  we 
evaluate  the  algorithm  on  two  datasets:  MoSeg  [23]  (de¬ 
signed  for  video  object  segmentation  with  no  consideration 
for  depth  ordering),  on  which  we  focus,  as  well  as  BVSD 
[17]  (designed  for  video  segmentation). 

Evaluation  methodology.  We  follow  the  process  de¬ 
scribed  in  [23].  The  dataset  contains  59  sequences,  rang¬ 
ing  from  19  to  800  frames.  Each  has  pixel- wise  ground 
truth  annotation  for  a  sparse  subset  of  frames  (3^1).  As  in 
[23],  we  report  precision,  recall,  F -measure,  and  the  num¬ 
ber  of  extracted  objects  (regions  with  F-measure  >  0.75). 
For  multi-label  segmentation  tasks,  we  treat  each  connected 
component  of  the  depth  layers  as  a  unique  “object”.  We  also 
evaluate  on  foreground/background  (FG/BG)  video  object 
segmentation,  which  come  directly  from  depth  layers  as 
FG  =  {x  :  c{x)  >1},  BG  =  {x  :  c{x)  =  0}.  Precision, 
recall,  and  F-measure  are  reported  on  the  ground  truth  anno¬ 
tations  converted  to  binary  masks.  Note  we  cannot  evaluate 
“number  of  extracted  objects”  in  the  FG/BG  scenario. 

The  methods  we  compare  against  ([18,  22,  24,  23])  are 
non-causal  and  “batch”,  whereas  our  method  is  causal.  Since 
we  do  not  know  the  future,  we  do  not  detect  objects  until 
they  undergo  sufficient  motion,  which  sometimes  causes  us 
to  miss  objects  in  the  beginning  of  video  sequences.  To  fairly 
compare  against  non-causal  methods,  we  also  perform  a  non- 
causal  evaluation  (reported  as  “NC”) — we  run  our  algorithm 
forward  in  time  to  accumulate  all  priors,  and  then  backward 
in  time.  The  latter  half  is  used  for  evaluation. 

Effects  of  system  components.  In  Sec.  3  we  described 
individual  components  of  our  model  and  showed  examples 
where  they  improved  results  (see  Fig.  4,  5).  Here  we  quan¬ 
tify  this  improvement.  We  evaluate  [3]  (“BASIC”),  their 
temporal  extension  (“TE”),  foreground-background  prior 
(Sec.  3.1,  “FG”),  and  the  full  model  (“FULL”).  In  addition, 
we  evaluated  the  full  model  without  flow  extrapolation  (Sec. 
4.2)  to  understand  its  effects  (“NOFE”).  These  results  are 
reported  in  Table  1 .  “BASIC”  does  not  use  long-term  tem¬ 
poral  information.  “TE”  integrates  weights  using  previous 
segmentations,  increasing  the  cost  of  making  a  cut  away 
from  object  boundaries.  “FG”  discourages  previously  seg¬ 
mented  regions  from  falling  into  background.  “FULL”  is  a 
combination  of  all  components. 

The  “BASIC”  method  does  not  use  temporal  information, 
so  on  the  multi-label  benchmark,  whenever  objects  disappear 
(as  they  often  do,  due  to  insufficient  motion)  and  re-appear, 
they  are  assigned  a  new  object  label.  Long-term  integra¬ 
tion  helps  avoid  missed  detections  and  propagates  object 
labels  throughout  the  sequence.  Performance  on  the  FG/BG 
evaluation  suggests  that  objects  are  often  not  detected  at  all. 


training  set 


►  GrundmannlO  #  Broxl4  ■  TE  v  FULL  ★  FULL-NC 

(a) 


^  Graumanll  ®  Papazogloul3  ■  TE  v  FULL  ★  FULL-NC 

(b) 


Figure  7:  (a-b)  Comparison  on  MoSeg:  (a)  multi-label  segmentation,  (b)  FG/BG  segmentation,  (c)  Comparison  on  BVSD. 


Multi-label  segmentation 


Training  set  (29  sequences) 

Test  set  (30  sequences) 

P 

R 

E 

N/65 

P 

R 

E 

A^/69 

BASIC 

84.90 

53.10 

65.34 

10 

78.80 

44.49 

56.87 

4 

TE 

87.20 

59.60 

70.81 

17 

79.64 

50.73 

61.98 

7 

EG 

86.98 

60.99 

71.71 

18 

79.04 

52.08 

62.79 

10 

NOEE 

86.67 

58.06 

69.54 

14 

80.71 

50.64 

62.24 

8 

EULL 

85.00 

67.99 

75.55 

21 

82.37 

58.37 

68.32 

17 

EULL-NC 

83.00 

70.10 

76.01 

23 

77.94 

59.14 

67.25 

15 

[18] 

79.17 

47.55 

59.42 

4 

77.11 

42.99 

55.20 

5 

[23] 

81.50 

63.23 

71.21 

16 

74.91 

60.14 

66.72 

20 

Binary  segmentation 


Training  set  (29  sequences) 

Test  set  (30  sequences) 

P 

R 

E 

- 

P 

R 

E 

- 

BASIC 

89.99 

40.86 

56.21 

- 

93.21 

33.69 

49.49 

- 

TE 

75.94 

61.64 

68.05 

- 

78.11 

54.68 

64.33 

- 

EG 

75.93 

63.07 

68.91 

- 

76.97 

56.16 

64.94 

- 

NOEE 

68.92 

66.09 

67.48 

- 

74.27 

53.99 

62.52 

- 

EULL 

83.92 

68.19 

75.24 

- 

86.54 

63.20 

73.05 

- 

EULL-NC 

79.26 

78.99 

79.12 

- 

83.41 

67.91 

74.87 

- 

[22] 

64.86 

52.70 

58.15 

- 

62.32 

55.97 

58.97 

- 

[24] 

71.34 

70.66 

71.00 

- 

76.29 

63.29 

69.18 

- 

Table  1:  Comparison  of  our  approach  (rows  4-5)  to  baselines 
using  individual  components  (rows  1-3)  and  state-of-the-art 
(rows  6-7)  on  the  MoSeg  dataset.  R=recall,  P=precision, 
F=F-measure,  N=  number  of  extracted  objects. 

Precision  decreases  for  the  “FULL”  system  due  to  an 
increased  number  of  “false  positives” — often  we  detect  more 
objects  than  labeled  in  the  annotation  (see  Fig.  8).  “NC” 
provides  a  small  performance  boost  by  allowing  us  to  label 
objects  before  they  move. 

Video  object  segmentation.  In  Table  1  and  Fig.  7  we 
report  results  of  the  comparison  with  multi-label  dense  mo¬ 
tion  segmentation  [23],  video  over- segmentation  [18],  as 
well  as  binary  (i.e.  FG/BG)  video  object  segmentation  meth¬ 
ods  [22,  24].  On  multi-label  segmentation,  we  outperform 
[18], [3],  and  [23]  in  F-measure.  The  improvement  from 
the  latter  is  not  great;  however,  note  that  unlike  theirs,  our 
method  is  causal  and  has  a  small  memory  footprint.  We 
are  not  the  best  in  terms  of  “number  of  extracted  objects”. 
As  mentioned  before,  unless  the  object  undergoes  sufficient 
motion,  it  will  not  be  detected.  On  FG/BG  segmentation,  we 
outperform  [22], [24],  and  [3]. 

Video  segmentation.  BVSD  [30,  17]  contains  40  train¬ 
ing  and  60  testing  sequences,  each  up  to  121  frames.  Pixel- 
wise  ground  truth  annotation  is  provided  for  a  subset  of 
frames.  Video  sequences  are  in  HD;  we  resize  images  to 
540  X  960.  While  we  report  results  for  a  variety  of  algorithms 


[11,  1,  16,  34]  (with  data  from  [17]),  our  primary  point  of 
comparison  is  [23].  Performance  is  benchmarked  using 
“boundary  precision-recall”  (BPR)  and  “volume  precision- 
recall”  (VPR)  metrics.  BPR  is  commonly  used  in  image  seg¬ 
mentation,  while  VPR  quantifies  the  spatiotemporal  overlap 
between  machine-generated  and  ground-truth  segmentations 
(see  [17]  for  details). 

Video  object  segmentation  algorithms  are  expected  to  be 
in  the  high-precision  regime  in  BPR,  and  in  the  high-recall 
regime  in  VPR,  which  indeed  both  we  and  [23]  satisfy  (see 
Fig.  7).  We  obtain  {P,R,F)  =  (0.760,0.186,0.299)  and 
(0.136, 0.870, 0.234)  on  BPR  and  VPR  respectively,  while 
they  obtain  (0.566, 0.100, 0.170)  and  (0.146, 0.852, 0.249). 
Sample  results  are  in  Fig.  9.  Note  that  the  ground  truth  is  of¬ 
ten  fine-grained — with  objects  spanned  by  multiple  regions. 
Thus,  on  this  benchmark,  object  segmentation  methods  will 
not  obtain  the  best  F-measure. 

Timing.  Given  optical  flow  (which  video  segmentation 
often  requires  as  input),  our  algorithm  takes  30s  for  VGA 
images  on  a  standard  desktop;  most  of  the  time  is  spent 
solving  (11),  but  a  GPU  implementation  can  reduce  this. 

7.  Discussion 

Occlusion  relations  inform  the  partition  of  the  image 
domain  into  segments,  but  proper  inference  of  such  rela¬ 
tions  requires  knowledge  of  the  segments  in  turn.  Rather 
than  tackling  an  intractable  chicken-and-egg  problem,  we 
use  priors  informed  by  Gestalt  principles  to  arrive  at  a  con¬ 
vex  optimization  scheme  that  can  be  efficiently  solved  with 
primal-dual  methods.  To  compare  with  existing  bench¬ 
marks,  we  converted  our  layers  into  “objects”  and  into  “fore¬ 
ground/background”.  The  evaluation  highlights  strengths 
and  limitations  of  our  method,  with  some  of  the  latter  due 
to  the  particular  characteristics  of  the  benchmarks.  While 
our  scheme  still  relies  on  decent  optical  flow  and  occlusion 
detection  to  bootstrap  layer  segmentation,  it  is  less  prone  to 
cascading  failure  than  previous  methods,  as  it  better  exploits 
priors  on  motion,  appearance,  and  layer  consistency. 
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Figure  8:  Sample  results  on  MoSeg.  Left  to  right:  original  image,  ground  truth,  [22],  [24],  [18],  [23],  ours  (object  maps). 
Top  two  rows:  occlusion  cues  allow  us  to  obtain  even  the  barely  visible  cars  (row  1  -  orange,  row  2  -  red,  row  2  -  green). 
Row  3:  use  of  both  motion  and  appearance  cues  allows  us  to  generate  an  accurate  object  boundary.  Row  4:  occlusion  cues 
yield  three  depth  layers  (bicyclist,  tree,  background)  (see  also  Fig.  1 ).  Notice  that  the  tree  (and  some  cars  in  rows  1-2)  is  not 
annotated,  so  our  scheme  is  penalized  despite  providing  the  correct  answer.  [22,  24]  suffer  from  trailing  and  only  produce 
binary  segmentation.  [18]  suffers  from  oversegmentation.  [23]  performs  comparably  to  our  method;  The  last  two  rows  show 
failure  cases.  Row  5:  the  painting  is  recognized  as  an  ''objecf'  due  to  false  occlusion  detection;  the  hand  is  assigned  to  a 
separate  layer.  Row  6:  the  lioness  is  missed  due  to  insufficient  motion  and  lack  of  occlusions. 


Figure  9:  Sample  results  on  BVSD.  Left  to  right:  original  image,  ground  truth,  [18], [23],  ours  (object  maps).  Row  1:  as 
rejected  by  BPR,  our  method  produces  accurate  boundaries  (see  Fig.  7).  Both  actors  are  correctly  segmented — the  arm 
occluding  the  animal’s  body  is  a  distinct  depth  layer.  Row  2:  'Tailure  case” — complex  motion  and  inaccurate  how  can  result  in 
inaccurate  segmentations.  Row  3:  failure  case — object  is  not  detected  throughout  the  sequence  due  to  lack  of  occlusions. 
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A.  Additional  results 


In  the  main  paper,  we  reported  results  on  the  BVSD  dataset.  The  dataset  contains  three  “subtasks” — “motion”,  “camera 
motion”,  and  “non-rigid  motion”  (details  in  [17]).  Results  on  these  subtasks  are  reported  below  in  Fig.  10  and  Table  2. 


Figure  10:  Results  on  BVSD  subtasks;  from  left  to  right:  ‘"motion”,  “camera  motion”,  “nonrigid  motion” 
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Motion 

58.07 

33.50 

42.48 

21.93 

85.04 

34.87 

Camera  motion 

76.65 

18.87 

30.28 

14.99 

86.28 

25.55 

Nonrigid  motion 

53.71 

40.86 

46.42 

27.20 

84.45 

41.12 

General 

76.02 

18.65 

29.95 

13.56 

87.09 

23.46 

Table  2:  Precision-recall  statistics  of  our  method  on  BVSD  subtasks.  “General”  subtasks  contains  all  sequences  and  was 
reported  in  main  paper. 


B.  Optimization  details 


In  this  section  we  derive  the  prox-operators  used  by  the  primal-dual  algorithm. 


B.l.  Proximal  operator  for  G{c)  =  r^c  +  max(0, 1  -  c)  +  /{c>o}(c) 

First  define  gi{ci)  =  TiCi  +  Ki  max(0, 1  —  q)  +  so  that  G(c)  =  This  is  done  in  anticipation  of 

the  prox-operator  for  G  being  separable^ : 


Pi’ox^C.(2/)  =  arg min— ||c-  2/II2  +  r^c  + K'^max(0,l  -c) 

c>0  Z(J 


n  ^ 

=  arg  min  (q  -  2/*)^  +  nci  +  /tj  max(0, 1  -  q) 

c>0  ^ '  Z(J 
i=l 

=  (prox^g^  (2/1), ,  Prox^g^  (2/„)) 


(20) 

(21) 

(22) 


We  now  will  derive  the  expression  for  prox^^.  (i/i).  Since  the  general  form  of  the  expression  will  be  the  same  for  each  i,  we 
drop  the  index  for  clarity.  For  convenience,  let  /(c)  =  ^  (c  —  y)‘^  +  g{c)  (as  shown  in  Fig.  1 1,  /  is  a  parabola  with  a  kink),  i.e. 
prox^^(^)  =  arg  min  /(c).  Since  /  is  strictly  convex  and  the  minimizer  is  unique,  we  initially  ignore  the  constraint  {c  >  0}, 
eventually  projecting  the  minimizer  back  onto  the  feasible  domain  if  necessary.  The  subdifferential  of  /(c)  is 


'^(c-2/)  +  T 

df{c)  =  <  i(c  -  2/)  +  r  -  K 

3(c-2/)  +  r-K[0, 1] 


if  c  >  1 
if  0  <  c  <  1  . 
if  c  =  1 


(23) 


The  optimality  condition  0  G  df{c)  is  satisfied  when 


c 


* 


y-ar 
<  1 

max  (O,  y  —  a{r  —  k)) 


if  y  >  1  -h  err 

if  ^  G  [1  +  cr(r  —  k),  1  +  err]  . 
if  ^  <  1  +  cr(r  —  k) 


(24) 


These  three  cases  are  shown  in  Fig.  11, 


f(c)  with  y>l +(7T  /(c)  with  7/G[1 +C7(r-K),l +crr]  /(c)  with  <1 +(7(r-K) 


Figure  11:  Top:  /(c)  shown  for  the  three  cases  described  in  (24).  Bottom:  subdifferentials  of  corresponding  /(c).  Shown  in 
red  are  the  pairs  (c*,  /(c*)). 


which  can  be  rewritten  as 

P*'°^c^Si  (o>  “ill  (Vi  -  -  '^i)>  (1)2/*-  O-T, 

Using  (22)  and  interpreting  max/min  componentwise,  the  prox-operator  for  G  is  written  as 

prox^^(^)  =  max  ^0,  min  (^y  —  cf(t  —  /^),  max  (l,  ^  —  c 

^We  use  the  following  “separable  sum”  rule.  If/([  ^  ])  =  g{xi)  +  h{x2),  then  proxj([  ^  ~  (  prox^(x^) 


(25) 


(26) 


B.2.  Proximal  operator  for  F^{y)^  with  Fi{y)  =  ||VI^^||i 


We  note  that  this  operator  appears  in  all  problems  that  involve  TV-regularization,  that  our  derivation  below  is  by  no  means 
novel,  and  that  it  is  included  only  for  completeness.  Since  the  function  is  a  (weighted)  norm,  we  expect  its  convex  conjugate 
(i.e.  F^)  to  be  a  (weighted)  indicator  of  a  dual  norm  ball.  Moreover,  we  expect  the  proximal  operator  of  the  indicator  function 
to  be  the  projection  onto  the  feasible  set  [6].  The  steps  below  verify  it. 


Fl{z)  =  supy^2;  -  Fi{y) 
y 

(27) 

II 

maxyi^i  -  Wi\yi\ 
yi  ^ - V - ^ 

9i{yi) 

(28) 

The  maximum  of  each  term  gi{yi)  can  be  determined  as 

max^i(^i)  =  w 
yi 

i(  maxyi—  -  |yi|) 

\  yi  Wi  J 

(29) 

■( 

0  if|^|<l 

\Wi\  — 

+00  else 

(30) 

so  Fi{z)  is  the  indicator  of  the  weighed  loo  ball: 

[+0O 

if||W^-'^||oo<l 

else 

(31) 

Using  diag(lU)  to  write  the  main  diagonal  as  a  vector  (i.e.  diag(lU)  =  (lUn, . . . ,  Wnn))^  the  prox  operator  for  Ff  can  be 
written  as 


prox^p.(2/)  =  argmin  — 11^  -  2/11^  +  Fl{z) 

1  2;  Z(j 

=  sign(2/)  min{diag(VF),  |2/|}. 

B.3.  Proximal  operator  for  F^iy)  ,  with  F2{y)  =  max(0,  l-y) 

The  convex  conjugate  is 

^2  (2:)  =  max^'^y  -  F2{y) 


=  max  yiZi  -  Xi  max(0, 1  -  yt) . 
Vi  ' - V - ' 

9i(yi) 


i=l 


(32) 

(33) 


(34) 

(35) 


Notice  that  when  Zi  >  0,  max^.  gi{yi)  — >  oo  (achieved  with  y*  oo)  and  similarly  maxj^.  gi{yi)  oo  when  2,  <  —  A* 
(achieved  with  y^  — )■  — oo).  These  cases  are  shown  in  Fig.  12.  So,  F^iz)  =  oo  for  2:  ^  [—A,  0].  For  2  G  [—A,  0]  we  compute 
the  subdifferential  of  gi{yi)  (both  shown  in  Fig.  12): 


d9i{y)  =  < 


F  \  yi  <  1 

A|^0,  ij  yi  =  \ 

F  Vi 


(36) 


The  optimality  condition  0  G  dgi{yl)  suggests  that  when  Zi  G  (— A^,  0),  =  1.  In  other  words,  for  that  interval,  we  have 

max^.  gi{yi)  =  Zi.  On  the  interval  boundaries  is  not  unique  (when  Zi  =  —A,  G  (— cxo,  1],  and  when  zi  =  0,  G  [1,  cx)) 
),  but  max^.  gi{yi)  =  Zi  still  holds.  So  the  convex  conjugate  can  be  written  as  follows: 


z  if  z  G  [—A,  0] 
+(X)  else 


(37) 


Figure  12:  Left;  g{y)  plotted  for  z  ^  [—A,  0].  As  noted  in  text,  these  functions  are  unbounded.  Middle:  g{y)  with  z  G  (—A,  0). 
This  is  a  piecewise  linear  function  with  a  unique  maximizer  (if  z  G  {—A,  0},  the  function  is  bounded,  but  the  maximizer  is  not 
unique).  Right:  subdifferential  of  g{y)  on  the  left. 


To  solve  for  prox^^>.  (z),  we  first  find  argmin^  ^\\z  —  xWl  x  (ignoring  the  constraint  x  G  [—A,  0]),  and  then  project 
it  onto  the  feasible  set: 

(z)  =  arg  imn  T\\z-x\\l+F^  {x)  (38) 

=  min  (max  (z  —  al,  —  A),0).  (39) 


B. 4.  Layer  unity  prior  and  aggregated  TV  weights 

In  section  3.2  we  made  a  claim  that  enforcing  Vc{x)  =  0  with  a  hinge  loss  penalty  and  associated  cost  u{x)  (nonzero  for  x 
where  we  want  to  enforce  the  constraint,  and  is  zero  elsewhere)  is  equivalent  to  increasing  weights  in  TV  regularization.  We 
prove  this  claim  here. 

Vc{x)  =  0  I  Vc(x)|  <  0.  Relaxed  as  a  hinge  loss  penalty,  this  becomes 

/  u{x)  \\/ c{x)\)dx  =  /  u{x)\\/c{x)\dx.  (40) 

J  D  J  D 

But  this  is  just  TV  regularization,  and  the  objective  already  includes  a  penalty  of  the  same  form  with  weights  g{x).  We 
conclude  that  the  new  “biased”  penalty  is  g' {x)\V c{x)\dx  with  g'{x)  =  g{x)  +  u{x). 

C.  Flow  extrapolation  via  cross  bilateral  filter 

In  this  section,  we  show  additional  examples  of  the  result  of  flow  extrapolation  in  the  occluded  region,  which  allows  us  to 
associate  occluder  points  to  occluded  points.  In  each  figure,  the  estimated  motion  field  (t’J^^)  and  the  extrapolated  motion 
field  are  shown,  below  which  regions  of  notable  change  are  shown  in  individual  in  panels  for  closer  inspection.  In  most 

cases,  the  extrapolation  step  locally  improves  optical  flow. 

For  the  car  in  Fig.  13,  panels  A-F  show  signs  of  improvement,  where  the  flow  in  the  occluded  region  becomes  more  similar 
to  the  “locally  background”  object.  We  note  specific  improvements  as  follows:  A:  the  rear- view  mirror  is  obtained.  B:  the 
motion  in  the  region  to  the  left  of  the  car  windshield  becomes  more  similar  to  the  background  motion.  C:  incorrect  optical 
flow  in  front  of  the  car  is  fixed.  D-E:  most  of  the  occluded  region’s  motion  (in  front  of  and  below  the  car)  becomes  more 
similar  to  the  background’s.  F:  the  flow  of  the  car  wheel  is  independent  of  the  vehicle  motion,  but  is  smoothed  out.  For  the 
task  of  object  segmentation  at  this  scale,  this  is  desired  behaviour. 

For  the  horse  in  Fig.  13,  panels  A-C  show  improvement.  Box  D  is  a  failure  example:  extrapolated  flow  is  worse  than  the 
original.  Box  E  shows  negligible  improvement.  Details  follow:  A:  optical  flow  in  the  region  around  the  head  becomes  similar 
to  background.  B:  the  erroneous  motion  in  front  of  the  horse’s  nostril  is  removed.  C:  motion  in  front  of  the  horse’s  neck 
becomes  is  cleaned  up.  D:  the  left  front  leg  of  the  horses  is  smoothed  out,  as  the  motion  and  appearance  of  are  similar  to 
background.  When  this  occurs,  the  flow  smoothes  across  the  weak  flow  boundaries.  E:  the  details  on  the  tail  improve  slightly. 


Figure  13:  Two  examples  comparing  the  original  optical  flow  held  (top  left)  and  the  resulting  how  post-extrapolation 

(top  right).  Below  each  pair  are  panels  highlighting  regions  where  signihcant  changed  occurred.  For  each  box,  is 
shown  in  the  top  row  while  the  extrapolated  result  (v^~^^)  is  shown  below  it. 


C.l.  Constraint  perturbation 

The  ability  to  compute  occluded-occluder  pairs  y)  reliably  requires  accurate  optical  flow,  which  may  be  difficult  to 
obtain,  especially  in  regions  with  large  motion.  Inaccurate  flow  may  lead  to  an  incorrect  constraint  (e.g.  both  points  falling  on 
the  occlud^J  region).  To  be  robust  against  failure  of  optical  flow,  we  assume  that  depth  layers  can  be  locally  discriminated  by 
their  appearance,  and  locally  perturb  the  constraints  to  ensure  that  both  points  fall  on  their  respective  sides  of  the  occlusion 
boundary. 

Specifically,  we  model  the  local  appearance  of  local  foreground  {occluder)  and  local  background  {occluded)  regions  by  a 
pair  of  Gaussian  mixture  models  (GMM)  with  respective  likelihoods  fp  and  /b  •  Learning  these  models  would  be  simple 
if  we  could  obtain  a  set  of  samples  from  F  and  B  regions.  Since  these  are  not  known,  we  estimate  fp  and  fs  from  nearby 
“occluder”  and  “occluded”  points.  We  found  that  this  approach  works  well  in  practice,  despite  the  implicit  assumption  that 
the  majority  of  these  points  fall  into  correct  regions.  To  measure  how  well  a  point  matches  the  appearance  of  occluder  and 
occluded  regions,  we  introduce  likelihood  ratios  (j)F{x)  =  log  and  If  >  0  (resp.  <  0),  a 

point  is  likely  in  the  foreground  region  (resp.  in  background  region). 

We  would  like  to  transform  the  pair  y)  to  ensure  that  y^  lies  in  foreground  region  and  that  y  lies  in  background 
region,  while  penalizing  large  transformations.  To  achieve  this,  the  procedure  tractable,  we  restrict  the  transformation  to  be 
parameterized  by  translation,  rotation,  and  uniform  scale  (i.e.  a  similarity).  Formally,  we  can  write  down  an  optimization 
problem: 

max  (g  o  +  0s  (g  o  —  ||g||  5  with  g  a  similarity  transformation  (41) 

where  we  use  the  notation  g  o  x  to  denote  a  point  x  transformed  by  g.  The  first  two  terms  measure  how  well  the  image 
intensities  near  transformed  points  match  the  appearance  models.  The  deformation  penalty  is  denoted  by  ||g||.  Once  the  best 
transformation  g*  is  found,  we  can  use  the  transformed  constraint  pair  (g*  o  g*  o  ^  ).  In  practice,  we  compute  the  value  of 
(41)  for  multiple  local  transformations,  and  choose  the  best  one.  An  example  of  this  procedure  is  shown  in  Fig.  14. 

C.2.  Local  shape  classifiers 

In  some  image  regions,  poor  motion  estimation  or  excessive  clutter  can  confuse  the  temporal  integration  and  computation  of 
segmentation  weights  {gt{x)  in  (1 1))  as  both  are  based  on  pixelwise  difference — this  can  make  determining  object  boundaries 
difficult  despite  temporal  integration.  To  reliably  segment  the  objects  from  the  background,  we  incorporate  shape  information 
over  a  local  region  near  the  object  boundaries.  In  a  fashion  similar  to  [4],  we  employ  a  set  of  overlapping  localized  shape 
classifiers  to  help  locally  discriminate  between  foreground  objects  and  the  background.  Each  classifier  learns  an  appearance 


Figure  14:  Constraint  perturbation  helps  clean  up  poorly  estimated  constraints.  Left:  original  constraints  before  perturbation. 
Middle:  improved  constraints.  Right:  several  occluder-occluded  pixel  pairs  where  perturbation  ensures  that  estimate  of  the 
occluder's  position  lies  on  the  foreground  object  (the  local  foreground  probability  map  is  shown  under  (j)F)- 


model  and  a  shape  model  of  the  object  within  a  small  region,  and  then  propagates  this  information  to  the  next  frame  by  locally 
adjusting  strength  of  the  foreground  prior  nt. 

After  the  first  frame  is  partitioned,  a  set  of  local  classifiers  is  instantiated  along  object  contours  to  track  the  boundary  of 
the  foreground  object  Ofg  within  a  small  window  VF  (31  x  31  pixels).  Local  shape  is  determined  by  the  mask  of  Ofg  in  W, 
denoted  m(x),  and  the  local  appearance  of  Ofg  is  modeled  by  a  GMM  (3  modes).  A  confidence  Cfg  on  how  well  the  color 
model  discriminates  between  o  fg  and  background  is  computed  as 


^fg  ~  ^ 


Iw  \m{x)  —  Pc{x)\wc{x)dx 
fwWc(x)dx 


(42) 


where  Pc  is  the  foreground  probability  computed  from  the  appearance  GMM  and  Wc{x)  =  exp{—cfi  (a;) /cr^ ),  which  weights 
the  contribution  of  each  pixel  based  on  d{x)  (the  distance  between  x  and  the  foreground-background  boundary,  computed 
using  a  distance  transform),  dc  is  set  to  half  the  window  size  (for  us,  15  pixels)  (see  (2)  in  [4]  for  further  details).  Next,  the 
local  classifier  is  warped  forward  by  the  motion  of  Ofg.  Based  on  Cfg,  appearance  and  shape  models  adaptively  combined  to 
provide  a  prior  on  which  pixels  in  W  are  is  likely  to  be  foreground 


Pfgi^)  =  CfgPc{x)  +  (1  -  Cfg)L{w\+^{x)),  (43) 

where  L{w\'^^{x))  is  the  binary  mask  L{x)  warped  into  the  current  frame.  Finally,  the  contributions  from  each  local 
classifier  are  combined  together  and  added  to  K,t{x)  in  (11). 

After  the  first  frame,  each  classifier  is  propagated  forward  until  its  support  contains  no  foreground  objects,  at  which  point  it 
is  dropped.  New  classifiers  are  instantiated  where  object  boundaries  in  the  current  frame’s  segmentation  are  not  covered  by  any 
local  classifiers.  See  [4]  for  further  details.  Generally,  we  find  that  these  classifiers  improve  the  segmentation  of  foreground 
objects  in  regions  where  a  strong  boundary  is  not  discernible  as  shown  in  Fig.  15.  In  addition,  for  small  objects  moving 
quickly,  these  classifiers  help  preserve  the  object  as  the  foreground  prior  can  sometimes  be  eroded  away  as  shown  in  Fig.  16. 


Figure  15:  Two  examples  of  the  local  shape  classifiers  boosting  the  foreground  prior  (Kt)  near  object  edges,  where  nt  is 
computed  without  the  classifiers  on  the  left  and  incorporating  them  (with  the  classifier  windows  overlayed  in  light  gray)  on  the 
right  Notice  how  Kf  better  captures  the  shape  of  the  woman’s  hat  behind  the  bush  (left)  and  the  legs  of  the  horse  (right). 


Figure  16:  Examples  of  local  shape  classifiers  preserving  small  objects  when  the  foreground  prior  term  alone  would  erode 
away.  From  left  to  right:  nt  without  local  classifiers,  the  corresponding  layer  segmentation,  Kf  incorporating  the  classifiers 
(with  the  windows  drawn  in  light  gray),  and  the  resulting  layers.  The  camel’s  head  and  legs  are  better  captured  with  the  help  of 
local  shape  classifiers  (row  1 ).  Without  them,  much  of  the  object  may  go  missing  (row  2)  or  even  become  lost  altogether  (rows 
3-4).  These  local  shape  classifiers  are  beneficial  to  long-term  temporal  consistency. 


